15 research outputs found
Future Frame Prediction for Anomaly Detection -- A New Baseline
Anomaly detection in videos refers to the identification of events that do
not conform to expected behavior. However, almost all existing methods tackle
the problem by minimizing the reconstruction errors of training data, which
cannot guarantee a larger reconstruction error for an abnormal event. In this
paper, we propose to tackle the anomaly detection problem within a video
prediction framework. To the best of our knowledge, this is the first work that
leverages the difference between a predicted future frame and its ground truth
to detect an abnormal event. To predict a future frame with higher quality for
normal events, other than the commonly used appearance (spatial) constraints on
intensity and gradient, we also introduce a motion (temporal) constraint in
video prediction by enforcing the optical flow between predicted frames and
ground truth frames to be consistent, and this is the first work that
introduces a temporal constraint into the video prediction task. Such spatial
and motion constraints facilitate the future frame prediction for normal
events, and consequently facilitate to identify those abnormal events that do
not conform the expectation. Extensive experiments on both a toy dataset and
some publicly available datasets validate the effectiveness of our method in
terms of robustness to the uncertainty in normal events and the sensitivity to
abnormal events.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning
Existing fine-tuning methods either tune all parameters of the pre-trained
model (full fine-tuning), which is not efficient, or only tune the last linear
layer (linear probing), which suffers a significant accuracy drop compared to
the full fine-tuning. In this paper, we propose a new parameter-efficient
fine-tuning method termed as SSF, representing that researchers only need to
Scale and Shift the deep Features extracted by a pre-trained model to catch up
with the performance of full fine-tuning. In this way, SSF also surprisingly
outperforms other parameter-efficient fine-tuning approaches even with a
smaller number of tunable parameters. Furthermore, different from some existing
parameter-efficient fine-tuning methods (e.g., Adapter or VPT) that introduce
the extra parameters and computational cost in the training and inference
stages, SSF only adds learnable parameters during the training stage, and these
additional parameters can be merged into the original pre-trained model weights
via re-parameterization in the inference phase. With the proposed SSF, our
model obtains 2.46% (90.72% vs. 88.54%) and 11.48% (73.10% vs. 65.57%)
performance improvement on FGVC and VTAB-1k in terms of Top-1 accuracy compared
to the full fine-tuning but only fine-tuning about 0.3M parameters. We also
conduct amounts of experiments in various model families (CNNs, Transformers,
and MLPs) and datasets. Results on 26 image classification datasets in total
and 3 robustness & out-of-distribution datasets show the effectiveness of SSF.
Code is available at https://github.com/dongzelian/SSF.Comment: Accepted by NeurIPS202
Priority-Centric Human Motion Generation in Discrete Latent Space
Text-to-motion generation is a formidable task, aiming to produce human
motions that align with the input text while also adhering to human
capabilities and physical laws. While there have been advancements in diffusion
models, their application in discrete spaces remains underexplored. Current
methods often overlook the varying significance of different motions, treating
them uniformly. It is essential to recognize that not all motions hold the same
relevance to a particular textual description. Some motions, being more salient
and informative, should be given precedence during generation. In response, we
introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM), which
utilizes a Transformer-based VQ-VAE to derive a concise, discrete motion
representation, incorporating a global self-attention mechanism and a
regularization term to counteract code collapse. We also present a motion
discrete diffusion model that employs an innovative noise schedule, determined
by the significance of each motion token within the entire motion sequence.
This approach retains the most salient motions during the reverse diffusion
process, leading to more semantically rich and varied motions. Additionally, we
formulate two strategies to gauge the importance of motion tokens, drawing from
both textual and visual indicators. Comprehensive experiments on the HumanML3D
and KIT-ML datasets confirm that our model surpasses existing techniques in
fidelity and diversity, particularly for intricate textual descriptions.Comment: Accepted by ICCV202
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
Adapter-style efficient transfer learning (ETL) has shown excellent
performance in the tuning of vision-language models (VLMs) under the low-data
regime, where only a few additional parameters are introduced to excavate the
task-specific knowledge based on the general and powerful representation of
VLMs. However, most adapter-style works face two limitations: (i) modeling
task-specific knowledge with a single modality only; and (ii) overlooking the
exploitation of the inter-class relationships in downstream tasks, thereby
leading to sub-optimal solutions. To mitigate that, we propose an effective
adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual
adapter by explicitly modeling the dual-modality structure knowledge (i.e., the
correlation of different semantics/classes in textual and visual modalities)
with a dual knowledge graph. In particular, the dual knowledge graph is
established with two sub-graphs, i.e., a textual knowledge sub-graph, and a
visual knowledge sub-graph, where the nodes and edges represent the
semantics/classes and their correlations in two modalities, respectively. This
enables the textual feature of each prompt to leverage the task-specific
structure knowledge from both textual and visual modalities, yielding a more
effective classifier for downstream tasks. Extensive experimental results on 11
benchmark datasets reveal that our GraphAdapter significantly outperforms
previous adapter-based methods. The code will be released at
https://github.com/lixinustc/GraphAdapterComment: Accepted by NeurIPS 2023. The manuscript will be further revised
based on the review
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sequential video understanding, as an emerging video understanding task, has
driven lots of researchers' attention because of its goal-oriented nature. This
paper studies weakly supervised sequential video understanding where the
accurate time-stamp level text-video alignment is not provided. We solve this
task by borrowing ideas from CLIP. Specifically, we use a transformer to
aggregate frame-level features for video representation and use a pre-trained
text encoder to encode the texts corresponding to each action and the whole
video, respectively. To model the correspondence between text and video, we
propose a multiple granularity loss, where the video-paragraph contrastive loss
enforces matching between the whole video and the complete script, and a
fine-grained frame-sentence contrastive loss enforces the matching between each
action and its description. As the frame-sentence correspondence is not
available, we propose to use the fact that video actions happen sequentially in
the temporal domain to generate pseudo frame-sentence correspondence and
supervise the network training with the pseudo labels. Extensive experiments on
video sequence verification and text-to-video matching show that our method
outperforms baselines by a large margin, which validates the effectiveness of
our proposed approach. Code is available at https://github.com/svip-lab/WeakSVRComment: CVPR 2023. Code: https://github.com/svip-lab/WeakSV
TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
We propose a novel task for generating 3D dance movements that simultaneously
incorporate both text and music modalities. Unlike existing works that generate
dance movements using a single modality such as music, our goal is to produce
richer dance movements guided by the instructive information provided by the
text. However, the lack of paired motion data with both music and text
modalities limits the ability to generate dance movements that integrate both.
To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to
project the motions of the two datasets into a latent space consisting of
quantized vectors, which effectively mix the motion tokens from the two
datasets with different distributions for training. Additionally, we propose a
cross-modal transformer to integrate text instructions into motion generation
architecture for generating 3D dance movements without degrading the
performance of music-conditioned dance generation. To better evaluate the
quality of the generated motion, we introduce two novel metrics, namely Motion
Prediction Distance (MPD) and Freezing Score, to measure the coherence and
freezing percentage of the generated motion. Extensive experiments show that
our approach can generate realistic and coherent dance movements conditioned on
both text and music while maintaining comparable performance with the two
single modalities. Code will be available at:
https://garfield-kh.github.io/TM2D/
Dataset Quantization
State-of-the-art deep neural networks are trained with large amounts
(millions or even billions) of data. The expensive computation and memory costs
make it difficult to train them on limited hardware resources, especially for
recent popular large language models (LLM) and computer vision models (CV).
Recent popular dataset distillation methods are thus developed, aiming to
reduce the number of training samples via synthesizing small-scale datasets via
gradient matching. However, as the gradient calculation is coupled with the
specific network architecture, the synthesized dataset is biased and performs
poorly when used for training unseen architectures. To address these
limitations, we present dataset quantization (DQ), a new framework to compress
large-scale datasets into small subsets which can be used for training any
neural network architectures. Extensive experiments demonstrate that DQ is able
to generate condensed small datasets for training unseen network architectures
with state-of-the-art compression ratios for lossless model training. To the
best of our knowledge, DQ is the first method that can successfully distill
large-scale datasets such as ImageNet-1k with a state-of-the-art compression
ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's
instruction tuning data, the models can be trained with negligible or no
performance drop for both vision tasks (including classification, semantic
segmentation, and object detection) as well as language tasks (including
instruction tuning tasks such as BBH and DROP).Comment: 9 page